A Comparative Analysis of Discrete Entropy Estimators for Large-Alphabet Problems

This paper presents a comparative study of entropy estimation in a large-alphabet regime. A variety of entropy estimators have been proposed over the years, where each estimator is designed for a different setup with its own strengths and caveats. As a consequence, no estimator is known to be universally better than the others. This work addresses this gap by comparing twenty-one entropy estimators in the studied regime, starting with the simplest plug-in estimator and leading up to the most recent neural network-based and polynomial approximate estimators. Our findings show that the estimators’ performance highly depends on the underlying distribution. Specifically, we distinguish between three types of distributions, ranging from uniform to degenerate distributions. For each class of distribution, we recommend the most suitable estimator. Further, we propose a sample-dependent approach, which again considers three classes of distribution, and report the top-performing estimators in each class. This approach provides a data-dependent framework for choosing the desired estimator in practical setups.


Introduction
Entropy estimation has long been a central area of research, driven by its role as a metric for measuring the uncertainty of source information [1].One persistent challenge is the estimation of entropy in scenarios involving a large alphabet and a small sample size.Such a scenario can occur, for example, in image recognition, where symbols represent RGB values.This setup is typically referred to as the large-alphabet regime, where the entropy estimators are shown to be biased [2] and the convergence rate can be slow [3].
Entropy estimation is used in a variety of fields, such as machine learning, cryptography, and data compression.Noteworthy applications include feature selection in machine learning [4] and the development and analysis of encryption methods in cryptography, particularly in the task of assessing entropy based on small sample sizes to obtain an estimator with minimal mean square error [5,6].Additionally, in natural language processing, a compelling application arises in the form of word-sense induction, which is a technique used for word clustering [7].For instance, the SemEval 2010 WSI task commonly exhibits a small average number of examples per word, while the count of sense clusters may be substantially higher, sometimes exceeding ten clusters per word in certain systems.
A variety of entropy estimators proposed in different research studies exhibit diverse performance in distinct scenarios [8].Notably, in the large-alphabet regime, numerous studies have been conducted [9][10][11][12][13][14].This research seeks to build upon these studies by analyzing the latest approach to entropy estimation using deep neural networks, with a specific emphasis on the large-alphabet regime, which poses challenges to conventional entropy estimation methods, such as those that rely on the plug-in principle [2,15,16].
The primary focus lies in entropy estimation using small samples drawn from multiple distributions, spanning from nearly deterministic to uniform.
This study extends the findings of a prior comparative analysis involving eighteen entropy estimators, as presented in [17], by incorporating two novel state-of-the-art estimators that are based on deep neural networks (DNNs) and polynomial approximation.The DNN-based estimator defined in [18] performs well in practice, while the polynomial estimator comes with many favorable performance guarantees [19].Consequently, it focuses on a broader variety of large-alphabet regimes across a wide range of distributions.Finally, our study provides guidance for selecting the most favorable entropy estimator for different setups.
This paper is organized as follows: Section 2 outlines the preliminaries, including fundamental entropy definitions and tools.Section 3 delves into the various entropy estimators and associated comparison studies.Section 4 details the experimental settings in this study, while Section 5 presents the results of the analysis using a variety of statistical measures.Lastly, Section 6 concludes with insights gleaned from the analysis and potential directions for future research.

Preliminaries
Shannon entropy serves as an information-theoretic metric for evaluating uncertainty in a random variable.For a discrete random variable X with a given distribution P = (p 1 , p 2 , ..., p k ) with an alphabet X of size |X | = k, Shannon entropy is defined as Given a collection of n iid samples from X, denoted by X n = {X i } n i=1 , our goal is to estimate H(X) from the sample, Ĥ(X n ).The empirical distribution is defined by P = ( p1 , p2 , ..., pk ), where each sampled probability follows px = ∑ n i=1 1(X i = x)/n, where 1() is the indicator function.
To assess the accuracy of the studied estimators, we focus on the mean squared error (MSE) between the entropy and its empirical estimation, which is a popular measure for comparison, as noted in [17][18][19].This measure includes the bias error and the variance error, making it an ideal candidate for measuring the entropy estimator's quality.The MSE satisfies where the bias of the estimator Ĥ(X) is defined as For unbiased estimators, the MSE represents the variance of the estimator, while in the case of a small-variance estimator, as seen in distributions closer to uniform, the bias significantly affects the MSE calculation.
Although the MSE offers a reliable method for estimating the bias and variance of estimators, the use of root MSE (RMSE) presents an additional advantage.Namely, it amplifies the differentiation between the estimators, especially when the differences are small, and expresses the error in the same unit as the entropy (bits).Our analysis involves evaluating the RMSE of one hundred measurements of entropy estimation for each combination of entropy estimator, sample size, alphabet size, and distribution.

Overview of Entropy Estimators
Over the years, a variety of entropy estimators have been introduced.This section presents a review of twenty-one entropy estimators recently introduced in various studies, while the explicit formulas of these entropy estimators are presented in Appendix A. The maximum likelihood estimator (plug-in) [20] is designed by using the entropy directly calculated from the empirical distribution It exhibits commendable results within the classical regime, typified by a large sample size and a small alphabet size.However, a significant negative bias is observed as the alphabet size increases [2].In response to this bias, several estimators were developed, including the Miller-Madow correction (MM) [15], which corrects the bias by incorporating a constant dependent on the non-zero sample probability count.Additionally, the jackknife estimator [16] proposes a correction based on estimation using the plug-in on all samples, excluding the j th sample.These corrections provide notable improvements for slight deviations from the classical regime, but as the sample-to-alphabet size ratio (STA ratio) decreases, these methods exhibit a large bias.Building on these is the Best Upper Bound (BUB) estimator [2], which takes a more systematic approach by approximating the optimal polynomial to H(X) within the space of n-degree polynomials, where n is the number of samples.This space precisely corresponds to the class of estimators that, like the plug-in, are linear in histogram order statistics.This estimator demonstrates superior performance over previous plug-in-based estimators when dealing with a small sample size and a small STA ratio.Additionally, it is worth noting that all the estimators mentioned up to this point are non-Bayesian.Another non-Bayesian alternative is the Grassberger entropy estimator (GSB) [10].This estimator demonstrates improved computation time, as it closely resembles the plug-in estimator, with the distinction that the logarithms are substituted with a G n function of the form where ψ(•) is the digamma function, and the function is specified for integer values and can be precomputed through recursion.Although the GSB estimator is generally considered a reasonable trade-off between bias and variance, the Schürmann (SHU) [21] estimator has shown that enhancements, particularly in terms of bias reduction at the expense of increased variance, can be achieved by generalizing the G n function to a one-parameter family of functions, denoted by G n (ζ).
The Chao-Shen estimator (CS) [22], also known as the coverage-adjusted estimator (CAE) [23], estimates the entropy by considering it as the summation of an unknown population H(X) = ∑ k b k , where each element in this population is defined by b k = −p k log(p k ).Later on, it utilizes the Horvitz and Thompson estimator [24] to provide an estimation for the total population.Specifically designed for scenarios with small sample sizes, the CS estimator is also capable of handling dependent observations.One more estimator in use is the James-Stein shrinkage (SHR) estimator [9].This estimator adopts a unique strategy by averaging two dissimilar models: a high-dimensional model with low bias and high variance and a lower-dimensional model with a higher bias and lower variance.The regularization level is controlled by the relative weights assigned to these two models.To achieve this, a convex function is applied to the empirical distribution where λ ∈ [0, 1] represents the shrinkage intensity, ranging from zero (no shrinkage) to one (full shrinkage), while t k denotes the shrinkage target, commonly defined as the probability of a uniform distribution.This estimator is designed to be effective in both the largealphabet regime and the classical regime.Additionally, it can transform into one of the Bayesian estimators when there are variations in the parameters t k and λ.The Bonachela (BN) estimator, as introduced in [25], is designed for scenarios marked by a small sample size and an STA ratio greater than one.In such cases, there is typically a relatively small alphabet size, where the empirical probabilities are not negligible.The primary goal of this estimator is to simultaneously minimize bias and variance across a broad range of probabilities.This approach strikes a balance between minimizing bias and addressing variance, which is particularly crucial when analyzing small sample sizes characterized by significant statistical fluctuations.Notably, the BN estimator is recognized for its numerical simplicity in implementation.
Built upon the Good-Turing formula [26,27], the Zhang entropy estimator [12] focuses on recovering distributional characteristics within the subset of the alphabet not covered by the sample size n.This approach leads to a notable increase in estimation accuracy compared to the plug-in estimator for any distribution with finite entropy.Moreover, the proposed estimator exhibits bias decay that is exponential in n.In cases of an infinite alphabet, the rate of bias decay is influenced by the distribution's tail behavior.An improvement for this estimator was given in [28].
The Chao-Wang-Jost estimator (CWJ) [14] takes a novel approach by reformulating Shannon entropy in terms of the expected discovery rates of new species relative to the sample size, represented by the successive slopes of the species accumulation curve.The estimator is derived by applying slope estimators obtained from an improved Good-Turing frequency formula [26].In evaluations conducted on finite alphabet sizes with an STA ratio greater than 0.1, the CWJ estimator demonstrated superior performance compared to the CS, GSB, Zhang, and jackknife estimators [14].
Within Bayesian statistics, significant attention has been devoted to the selection of priors for entropy estimation.Jeffrey's (JEF) prior [29], which is a symmetric Dirichlet distribution with the parameter a x = 1/2, has been demonstrated to asymptotically maximize Shannon's mutual information between X n and X [30].The Laplace (LAP) estimator [31], derived from the Bayes estimator of the Tsallis entropy under a uniform prior probability density (a x = 1), presents a modified perspective on the JEF estimator.The Schürmann-Grassberger (SG) estimator [32] extends the LAP estimator and numerically identifies that the most accurate estimates are achieved using a symmetric Dirichlet distribution with the parameter a x = 1/k as the prior.Building upon these, the minimax prior (MIN) estimator [33] formulates the estimation problem as a risk function, aiming to minimize the guaranteed value of the estimate.In the case of solving for a multivariate hypergeometric distribution, the Bayes estimator with a x = √ n/k as the prior is identified as the optimal solution.Additionally, the Nemenman-Schafee-Bialek (NSB) estimator [11] was also developed within the Bayesian framework, extending considerations to priors with a power-law dependence on probabilities, specifically within the Dirichlet family of priors.
The CDM estimator [34], short for centered Dirichlet mixture, serves as the prior for the Bayesian entropy estimator.It centers a Dirichlet distribution over all conceivable binary words around either an independent Bernoulli (DBer) or a synchrony (DSyn) distribution.Initially designed for estimating the entropy of neural spike trains, it has been extended to generalize to binary vector data.In comparison, the PYM estimator [35] is based on the Pitman-Yor mixture prior, implying a narrow prior distribution over H. Significantly, it has been demonstrated that this estimator remains consistent across a variety of distributions, particularly excelling in providing optimal estimations for distributions characterized by long-tail behavior.
The polynomial estimator [19] was developed through the approximation of entropy using the polynomial representation of variables in the form ϕ(x) = −x log x.This method achieves a balance between the plug-in estimation and the polynomial approximation by evenly splitting the sample and incorporating observed frequencies in each subset.Additionally, the unseen estimator [13] was suggested through a linear programming-based approach that leverages the sample to characterize the "unseen" segment of the distribution.Without making a priori assumptions about the distribution, the identification of unseen domain elements becomes inherently uncertain.Nonetheless, there is an effort to estimate the "shape" or histogram of the unseen part of the distribution, essentially quantifying the occurrence of unseen domain elements within various probability ranges.With such a reconstruction, the entropy of the distribution, dependent solely on the shape/histogram, can be estimated.Both of these estimators were specifically designed for large-alphabet regimes with varying STA ratios.
Finally, the neural joint entropy estimator (NJ) [18] introduces an innovative solution by employing a neural network-based entropy estimator.It minimizes the cross-entropy loss and derives the estimated entropy by individually estimating each element of the sum of conditional entropies.This estimator is specifically tailored to address large-alphabet regimes and small STA ratios.
Besides the entropy estimators already mentioned, the literature also includes several other schemes, including KNN [36], KDE [37], B-Spline [38], and Edgeworth [39] estimators.However, many of these are not adequate for the large-alphabet regime; hence, they are not included in this study.Additionally, certain neural network-based estimators, primarily designed for estimating mutual information rather than directly measuring entropy, such as MINE [40], JS [41], and SMILE [42], are also excluded, as they fall outside the scope of this work.

Past Research on Comparison of Entropy Estimators
This section reviews earlier comparative research on entropy estimators in the largealphabet regime, which is different from studies focusing on the classical regime [25].It also differs from studies that introduce a new estimator and offer a brief comparison to others, such as in [12][13][14]18,19].Starting with the analysis presented in [2], four estimators (plug-in, MM, BUB, and jackknife) were applied to both real and simulated neuron spike data.In the classical regime, convergence was observed among the plug-in, MM, and jackknife estimators as the sample size increased.However, as the settings changed to the large-alphabet regime, the bias increased and the BUB estimator consistently outperformed the others.
In [9], nine estimators (plug-in, MM, NSB, CS, SHR, MIN, SG, JEF, and LAP) were analyzed on a variety of Dirichlet and Zipf's Law distributions.It was shown that NSB, CS, and SHR outperformed the rest, exhibiting comparable results.Furthermore, the paper explores the Bayesian-based entropy estimators and emphasizes the significance of selecting an appropriate prior distribution with the parameter a x , asserting that an improper choice of prior can result in poorer performance than the plug-in estimator.
The study in [17] examined eighteen estimators, namely, MIN, LAP, SG, JEF, CS, unseen, plug-in, jackknife, NSB, BN, CDM, GSB, SHR, BUB, Zhang, CJ, MM, and SHU, by analyzing them on samples from a uniform distribution with large and small alphabet sizes.The findings revealed that the SHR estimator outperformed the others for samples from the large alphabet size.However, for samples from the small alphabet size, the use of both the MM and SHR estimators was suggested.Furthermore, the study demonstrated that the SHR estimator achieved the least bias and also attained the lowest MSE, which approached zero.
In summary, although research has been conducted on the comparison of entropy estimators in the large-alphabet regime, this study sets itself apart in several key aspects.First, this work focuses on two novel state-of-the-art estimators, namely, NJ (DNN-based) and polynomial estimators.Second, it examines a broader range of distributions, ranging from uniform to deterministic.Last, the study introduces novel conclusions for selecting favorable entropy estimators based on their empirical proximity to the uniform distribution, a concept not explored in prior research.

Experimental Settings
To ensure a comprehensive exploration of distributions ranging from uniform to deterministic, we examine different parametric distributions on a variety of parameter values.First, we examine the uniform distribution as a special case for estimating the limits of the estimators.Second, we examine the geometric distribution with the parameter p ∈ [0, 1], covering a wide range of distributions.Last, we examine the Zipf's Law distribution with the parameter α ∈ [0.001, 3.4].The upper limit is set to 3.4 since, above this value, the estimator's performance on the distributions shows consistent results.
The experiments cover a wide range of sample and alphabet sizes, separated into two main scenarios.In the first scenario, the alphabet size remains constant at k = 10 5 , with a varying sample size ranging from 100 to 10 4 , similar to previous studies [12,13,18,19].The second scenario, akin to prior studies [12][13][14], considers a varying alphabet size ranging from 100 to 10 5 and a fixed sample size of n = 1000.These scenarios, representative of the large-alphabet regime, are studied, and the performance of the entropy estimators is analyzed.

Varying Sample Size with Fixed Alphabet Size
The analysis of twenty-one entropy estimators using the previously specified experimental settings led to an initial conclusion that some estimators produce results that are either incompatible or redundant, reducing the total number of estimators to seventeen.The SHU estimator was excluded due to its similar behavior to GSB in the large-alphabet regime for the RMSE and bias analyses, thereby making its inclusion redundant for the comparison.The BN estimator was also eliminated, as it proved to be the least compatible compared to the other estimators in the analysis.Similarly, the Bayesian estimators, including JEF and LAP, were found to be unsuitable for the large-alphabet regime.These particular estimators exhibit a higher bias relative to the rest and result in a high RMSE when the distributions approach the deterministic end.
We begin our analysis with uniform and near-uniform distributions, such as the geometric (with p = 10 −5 ) and Zipf's Law (with α = 0.001) distributions, as shown in Figure 1.First, we observe that the SHR estimator outperforms all others in both the uniform and Zipf's Law distributions, surpassing the NJ and NSB estimators, while in the geometric distribution, the NJ and SHR estimators achieve comparable results.The results of the SHR estimator align with the findings of [17] for a uniform distribution.This behavior also extends to near-uniform distributions.NJ and NSB exhibit similar behavior, with NJ outperforming NSB by nearly one order of magnitude at sample sizes below 1000.Furthermore, the sharp point-wise improvements in the RMSE of the polynomial estimator can be attributed to its estimation initially intersecting and then surpassing the actual entropy from below.As the sample size increases, this leads to a convergence to a value lower than the true entropy.
Continuing the analysis of distributions deviating from the uniform distribution, referred to as the mid-uniform range, the NJ estimator consistently outperforms all others, often by an order of magnitude, while the NSB, unseen, jackknife, and BUB estimators present comparable results to the second best, as seen in Figure 2.This pattern continues in geometric distributions with p varying from 0.0001 to 0.01 and in Zipf's Law distributions where α varies from 0.4 to 1.4.This continues until the point where NJ loses its superiority and attains results similar to those of other estimators across various sample sizes, as will be further discussed in Section 5.4.This excellent performance is attributed to NJ's ability to generalize well, especially for small sample sizes [18].At the deterministic end of the distribution spectrum (Figure 3), the classical estimators and their modifications, such as Zhang, PYM, MM, SG, SHR, NSB, GSB, unseen, jackknife, CWJ, plug-in, CS, and CDM, converge to one another, outperforming all others across varying sample sizes.This range is defined by an effectively smaller alphabet size, given the low probability of the majority of values-a characteristic of long-tail distributions, resembling the classical regime.Notice that the polynomial, BUB, MIN, and NJ estimators are also not competitive within this range, suggesting that these estimators may not be well matched for deterministiclike distributions.
In conclusion, a summary of all the best-performing estimators for each distribution range is presented in Table 1.It can be seen that in the near-uniform distribution range, SHR, NJ, and NSB exhibit notable performance, where the SHR estimator outshines the other two.As the distributions deviate from uniform in the mid-uniform range, NJ emerges as a preferable choice.In the near-deterministic range, classical estimators based on the plug-in gradually converge, as shown in [17].This range is characterized by a relatively small alphabet size compared to the sample size, resembling the classical regime.

Varying Alphabet Size with Fixed Sample Size
In this setup, a variety of alphabet sizes are analyzed, all maintaining a constant sample size of n = 1000.Beginning with the uniform distribution, as depicted in Figure 4a, the SHR estimator emerges as the superior scheme, outstripping all other estimators by nearly an order of magnitude.Next, we observe that the NSB estimator exhibits the secondbest results.The estimators' performance is relatively stable across varying alphabet sizes.Notice that as the alphabet size increases, there is an increase in the RMSE of the estimators.This pattern is also reflected in the nearly uniform geometric distribution with p = 10 −5 in Figure 4b and Zipf's Law distribution with α = 0.001, as depicted in Figure 4c.The superior performance of the NSB, SHR, and NJ estimators is reflected across varying alphabet sizes.These results are in line with the findings of [9] regarding the excellent performance of the NSB and SHR estimators and the conclusions in [18] about NJ's robust generalization abilities in the large-alphabet regime.For the remaining estimators, the bias increases with the alphabet size, observable in GSB, plug-in, MM, jackknife, Zhang, SG, and MIN.These estimators also converge from a certain alphabet size.The performance of plug-in, MM, and jackknife is consistent with the findings in [2] within the large-alphabet regime.
As the distributions deviate from uniform in the mid-uniform range, the performance of NJ and NSB remains consistently superior across the alphabet range, as seen in Figure 5.In addition, the PYM, CDM, jackknife, and unseen estimators also show comparable patterns and attain the second-highest results as the size of the alphabet increases.However, in specific ranges, the SHR estimator shows superior performance.The SHR takes the lead particularly when dealing with smaller alphabet sizes, aligned with the findings of [17].In the near-deterministic range (Figure 6), as was shown in the previous section, convergence is exhibited by the majority of the estimators, including plug-in, MM, NSB, Zhang, CWJ, GSB, SG, unseen, CDM, and PYM, demonstrating similar patterns across the alphabet range.This convergence is largely consistent, resulting in an RMSE of less than 0.1 across the full alphabet range, and aligns with previous findings [2,35].However, as the alphabet size increases, the MIN shows a decline in performance, which can be attributed to its assumed prior, which depends more on the alphabet size than the sample values, resulting in an estimated probability similar to the uniform distribution in the large-alphabet regime.In the same vein, the RMSEs of the polynomial and BUB estimators increase when the alphabet size crosses the 1000 mark due to the former's lack of correct polynomial coefficients and the latter's bias.
To conclude, the most favorable estimators in each distribution range are similar to the ones obtained in Section 5.1 and presented in Table 1.This reveals that the SHR estimator significantly surpasses others in the near-uniform range, maintaining an almost steady RMSE of 0.001.In the mid-range, NJ shines with larger alphabet sizes, while SHR continues to lead for smaller ones.At the far-uniform end, the plug-in and other estimators designed for the classical regime converge and deliver top performance.The analysis of varying alphabet sizes strengthens the findings from Section 5.1, emphasizing that the performance of each estimator is more reliant on the distribution range than the alphabet size.An additional interesting conclusion is that the NJ estimator shows a steady performance across all experiments with varying alphabet sizes.This suggests that it is quite robust to the alphabet size.

Bias Analysis
Let us now study the bias term of the entropy estimators.From (3), it can be inferred that overestimating the actual entropy leads to a positive bias, while underestimating it results in a negative bias.As demonstrated in [2], the plug-in estimator is negatively biased in both the large-alphabet and classical regimes.For uniform distributions (Figure 7a), the NJ and SHR estimators demonstrate the least bias, nearly approaching zero, a result consistent with the findings in Sections 5.1 and 5.2.Geometric distributions are depicted in Figure 7b.The CDM, PYM, CS ,CWJ, NJ, NSB, and unseen estimators deliver the best outcomes, with the NJ estimator outperforming all others.This aligns with NJ's performance in [18] and the design of PYM to tackle long-tailed distributions [35], as well as the findings in Section 5.1 in the mid-and far-uniform distribution range.
In the case of the Zipf's Law distribution shown in Figure 7c, the CS, CWJ, NJ, NSB, unseen, and SHR estimators perform the best, with NJ showing only a slight positive bias.This mirrors the geometric distribution bias analysis, except that the SHR estimator performed significantly better, as it is more suited for Zipf's Law distributions, as outlined in [9].
In total, the NJ, NSB, CS, unseen, CWJ, CDM, and SHR estimators outperform all others, with NJ standing out as the superior one (Figure 7).Notably, PYM's performance on the geometric distribution also exhibits the minimum bias.This result is not surprising, as the method is tailored for distributions with long exponential or power-law tails [35].The largest bias is introduced by the polynomial estimator due to an inadequate polynomial fit.The remaining estimators exhibit a moderate bias size, which can vary depending on the setting.(c)

Distributions According to Parameter Analysis
We evaluate the performance of entropy estimators across the parameters of parametric distributions using the settings defined in Section 5.1.By examining the mean RMSE for each distribution, it can be seen that for each distribution family, distinct regions emerge where some estimators perform better than others.Specifically, within the geometric distributions, as depicted in Figure 8a, the SHR estimator excels over the others when p ≤ 10 −5 , in line with the findings in Section 5.1.This distribution, falling into the nearuniform range, mirrors results obtained for the uniform distribution in the previous section.For p values ranging from 0.0001 to 0.005, the NJ method outperforms all others, with NSB and unseen ranking as second best, as discussed in prior sections.When p exceeds 0.005, the distributions become more deterministic, resembling the classical regime.Here, NSB, CDM and CWJ converge and yield the best overall results within the near-deterministic range.
Continuing with the analysis of the Zipf's Law distribution family (Figure 8b), the SHR estimator stands out as the top performer when α ranges from 0.001 and 0.4, a finding consistent with the results in the previous sections in the near-uniform range.For α values between 0.4 and 1.4, the NJ estimator outperforms the rest, obtaining a slightly higher RMSE than the top three estimators in the previous range.This also corresponds to NJ's well-documented ability to generalize effectively [18].Intriguingly, SHR performs second best, as the distribution remains closer to the uniform distribution.However, beyond a certain point, PYM outperforms the others, as it is better adapted to long-tail exponential distributions [35].As α reaches 1.4 and beyond, the classical estimators converge, with the best performers listed in Table 1 in the near-deterministic range.
Overall, the analysis of the two distribution families reveals similar results, which are consistent with the findings in previous sections.Each type of distribution can be divided into three primary ranges: the near-uniform, mid-uniform, and far-uniform ranges.Within each range, the top-performing estimators tend to be similar, as described in Table 1.

Analysis of Total Variation Distance from Uniform Distribution
We now propose a different analysis that does not depend on the (unknown) underlying distribution p.Here, we consider the total variation (TV) between P and a uniform distribution.Formally, we define the empirical total variation (ETV) as This measure only depends on the sample and the alphabet size k.It can be evaluated in practice and can hence help one choose the preferred estimator in practical setups.Notice that in a large-alphabet regime, n ≤ k, the minimum ETV is obtained for uniform and empirical uniform distributions, ETV(P uni f orm , P) = 1 − n/k.On the other hand, its maximum value is achieved for a degenerate distribution, which leads to ETV(P uni f orm , P) = 1 − 1/k.In our study, we set a sample size of n = 1000 and an alphabet size of k = 10 4 , which leads to ETV ∈ [0.9, 1]. Figure 9 illustrates the ETV for the Zipf's Law and geometric distributions in the specified scenarios.The mean RMSE is evaluated at every distance for each estimator.As demonstrated in Figure 9a, the geometric distributions reveal three ranges for low ETV, mid-ETV, and high ETV.The low ETV falls below 0.91, the mid-ETV spans from 0.91 to 0.97, and the high ETV extends from 0.97 to the maximum distance of 1. Notably, NSB, SHR, and NJ yield the best results for low ETV values, while NJ excels in the mid-ETV range.The classical estimators surpass the others in the high-ETV range.These findings align with those noted in previous sections while also providing insight into their specific behaviors.Figure 9b presents the three ranges found in Zipf's Law distributions with similar boundaries.Another observation is that certain estimators, such as the polynomial, BUB, MIN, and NJ, exhibit increasing errors as the distance grows.These estimators do not perform competitively in the classical regime, as the distribution tends toward a deterministic end, and the classical estimators tend to yield better results.These insights are consistent with the previous analyses in Sections 5.1 and 5.2.The SHR estimator shows distinct behavior, initially presenting the best outcome within the low-ETV range, as shown in previous studies [17].However, it is soon overtaken by other estimators as the distribution distance for the uniform distribution increases, as its design better suits the low-and high-ETV ranges.Finally, the sharp improvement in the MIN estimator can be attributed to its prior matching the geometric distribution within this range.This reaffirms the claim made in previous works that claim that the selection of the prior significantly impacts the performance of Bayesian estimators [9].

Choosing the Most Favorable Entropy Estimator
Based on the analysis above, we can draw the following conclusions.The top estimators for each range are ranked based on their performance, with the first one being the most favorable.If the ETV distance falls in the low-ETV range, NJ, SHR, NSB, and CS are the most suitable.It is worth mentioning that the three estimators NJ, NSB, and SHR present competitive estimations due to their ability to handle distributions similar to a uniform one, as outlined in the preceding sections.
For an ETV distance within the mid-ETV range, the recommendation is to use the NJ estimator, followed by BUB, unseen, CDM, and NSB.NJ stands out as the overall best in this range, a finding that is consistent with previous sections and its capacity to generalize in the large-alphabet regime [18].NSB, with its mixed Dirichlet priors, also demonstrates strong performance due to its ability to capture various distributions with its broadly defined prior [11].
Finally, for high ETV values, the traditional estimators converge and produce the best results.The top performers are the PYM, jackknife, unseen, CDM, and NSB estimators.These results align with previous sections.
In summary, the proposed selection method based on the analysis presented in the previous sections is specifically engineered to achieve the most favorable entropy estimation in the large-alphabet regime (Figure 10).Interestingly, the NSB estimator consistently performs well across all ranges and can be a good starting point when estimating an unknown distribution.Despite NJ outperforming the other two estimators in the lowand mid-ETV ranges, its long computation time lessens its appeal as a first choice.It is important to emphasize that the ETV cut-off ranges depend on the choice of n and k.Thus, in order to decide whether an ETV value is low, medium, or high, one needs to create similar plots for different values of n and k.

Real-World Experiments
Let us study two real-world applications.In the first experiment, we studied English word frequencies.The English word frequency list describes the frequency at which each word appears in a language, based on hundreds of millions of words, collected from opensource subtitles (www.opensubtitles.org,accessed on 14 April 2024) or based on different dictionaries and glossaries (http://en.wiktionary.org/wiki/Wiktionary:Frequency_lists,accessed on 14 April 2024).This results in an alphabet size of approximately 500K words.Our goal was to estimate the entropy of the English word frequency list based on a sample of n independent observations from it.Next, we studied the Dow Jones Index (DJI), which demonstrates a time series setting.The DJI dataset contains the daily closing prices of 30 large companies on the U.S. stock exchange [18].Here, our goal was to estimate the marginal (stationary) entropy of the DJI.For this purpose, we focused on a relatively stationary time period between the years 1990 and 1997 (see Section 5.E. of [18]).The DJI closing values on each day were taken, and the frequency for each value was calculated.The resulting distribution consists of approximately 1600 values, where each symbol represents a unique closing value.Notice that, despite its ordinal nature, we treated each symbol as categorical for the purpose of this experiment.
For each of the datasets above, we drew n samples with replacement.The "true" entropy was evaluated from the empirical distribution of the entire dataset, and this entropy was compared to the estimated entropy derived from the samples.This procedure was repeated one hundred times, and the RMSE was calculated for each estimator and sample size.We note that, in both experiments, the underlying distributions (based on the entire datasets) are within the mid-uniform range.
We evaluate our conclusions from Section 5.1 as we focus on entropy estimators that are representative of each distribution range.For the English word dataset, Figure 11a shows that NJ, SHR, jackknife, and NSB attain the best results.These findings align with the conclusions in Table 1.The DJI dataset results presented in Figure 11b indicate that NSB, NJ, and SHR have the most competitive performance, further reinforcing previous conclusions.Overall, the analysis of the real-world datasets aligns with our previous findings in Section 5.1, showing slight differences in the performance of the estimators but maintaining the same general trend of RMSE improvement with the increase in the sample size.In both datasets, NJ, SHR, and NSB provide competitive results, consistent with the top performers in the mid-uniform range.

Discussion
This research compares twenty-one entropy estimators, including novel neural networkbased and polynomial approximate entropy estimators.It focuses on the large-alphabet regime across a variety of distributions from uniform to deterministic, extending the comparative study of [17].In the analysis, we distinguished between three different distribution ranges, namely, near-, mid-, and far-uniform.Further, we studied the low-, mid-, and high-ETV distances.Our findings indicate that the NJ, NSB, and SHR estimators yield the most favorable results in the near-uniform and low-ETV ranges, as evidenced by the SHR estimator, which also aligns with earlier studies, including [9,17].
From the findings in the near-uniform range, the Bayesian estimator's performance is highly reliant on the prior, suggesting a future research direction for optimizing the prior of more traditional Bayesian estimators such as SG and MIN to suit the specific problem settings.The NJ, NSB, unseen, and CWJ estimators stand out in the mid-uniform range, with NJ outperforming the other three.These findings are consistent with the mid-ETV range.
However, as the distribution shifts toward the far-uniform range, NJ, a neural networkbased estimator, begins to under-perform and overgeneralize.Future research could explore this challenge of the NJ estimator, addressing its high bias and lack of convergence as the distribution becomes more deterministic.Interestingly, increasing the network size does not significantly impact the estimator's performance, indicating that improvements need to be pursued through alternative strategies.
In the far-uniform as well as in the high-ETV range, classical estimators tend to converge, and the top performers include PYM, CDM, and NSB.Notably, NSB yields better outcomes in the high range and mid-range.Future studies could delve further into this estimator, conducting a more detailed comparison to outline the advantages and disadvantages of this estimator in the large-alphabet regime.
The bias analysis reveals that the polynomial estimator exhibits the largest bias, presenting another potential direction for research to investigate the intrinsic bias of this estimator.This estimator is designed for the large-alphabet regime, yet improper polynomial approximation results in significant under-performance.
Tying it all together, given an unknown distribution, a generalized assessment of the estimator choice can be performed based on the distribution range.In the low-ETV regime, estimators with a strong generalization ability are likely to produce the best results.These estimators exhibit low bias by correcting it according to the distributions.For mid-ETV values, the most favorable estimators can generalize and adapt to different types of distributions.They are not based on the plug-in estimator but instead present an alternative computation method to entropy estimation.Such methods include those based on neural network or linear programming.For a high ETV, estimators based on the plug-in method tend to converge and deliver the best performance.This range is characterized by a high ratio of the sample to alphabet size.  c 1 log 2 k) m−1 (y i ) m + (log 2 n c 1 log 2 k )y i ), where L is the polynomial degree, L = ⌊c 0 log 2 k⌋, a m is an approximate polynomial coefficient, ϕ(x) = −xlog 2 x, y i , y ′ i denotes the observed frequencies of the equally split sample, and c 0 , c 1 , c 2 > 0 are constants specified in [19].
unseen [13] unseen Algorithmic calculation based on linear programming.
Neural joint entropy estimator [18] NJ Neural network estimator based on minimizing the cross-entropy loss.

Figure 1 .
Figure 1.RMSE plots for multiple sample sizes and a large alphabet size of 10 5 for a uniform distribution (a), geometric distribution with p = 10 −5 (b), and Zipf's Law distribution with α = 0.001 (c).The y-axis representing the RMSE is on a log scale to better differentiate between the estimators.

Figure 2 .
Figure 2. RMSE plots for multiple sample sizes and a large alphabet size of 10 5 on a geometric distribution with p = 0.001 (a) and Zipf's Law distribution with α = 1 (b).

Figure 3 .
Figure 3. RMSE plots for multiple sample sizes and a large alphabet size of 10 5 on a geometric distribution with p = 0.9 (a) and Zipf's Law distribution with α = 3 (b).

Figure 4 .
Figure 4. RMSE plots for multiple alphabet sizes with a constant sample size of 1000 for a uniform distribution (a), geometric distribution with p = 0.00001 (b), and Zipf's Law distribution with α = 0.001 (c).

Figure 5 .
Figure 5. RMSE plots for multiple alphabet sizes with a constant sample size of 1000 on a geometric distribution with p = 0.001 (a) and Zipf's Law distribution with α = 0.4 (b).

Figure 6 .
Figure 6.RMSE plots for multiple alphabet sizes with a constant sample size of 1000 on a geometric distribution with p = 0.5 (a) and Zipf's Law distribution with α = 2 (b).
e e n Ja c k k n if e M IN P lu g -in P o ly n o m ia l S H R Z h a n s e e n Ja c k k n if e M IN P lu g -in P o ly n o m ia l S H R Z h a e e n Ja c k k n if e M IN P lu g -in P o ly n o m ia l S H R Z h a n

Figure 7 .
Figure 7.The bias in bits across all entropy estimators for uniform (a), geometric (b), and Zipf's Law (c) distributions with an alphabet size of 10 5 and varying sample sizes.

Figure 8 .
Figure 8. Geometric and Zipf's Law distributions' mean RMSEs of estimators at varying sample sizes and a fixed alphabet size of 10 5 for each respective distribution parameter, with geometric p in (a) and Zipf's Law α in (b).

Figure 9 .
Figure 9. RMSE plots of estimators for the geometric and Zipf's Law distributions with a sample size of n = 1000 and an alphabet size of k = 10 4 .For each drawn sample, we evaluated its ETV and computed its corresponding mean RMSE for different entropy estimators.The geometric distributions are presented in (a), while the Zipf's Law distributions are in (b).

Figure 10 .
Figure 10.A decision tree for selecting the most effective entropy estimator for an unknown distribution in the large-alphabet regime.

Figure 11 .
Figure 11.The mean RMSE of selected estimators for the English word and DJI datasets.The English word dataset is presented in (a), while the DJI dataset is in (b).

Table 1 .
Best entropy estimators for each range, ordered by performance.